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1 Abstract 


Collectively, lung cancer, breast cancer and melanoma was diagnosed in 
over 535,340 people out of which, 209,400 deaths were reported [13]. It is 
estimated that over 600,000 people will be diagnosed with these forms of 
cancer in 2015. Most of the deaths from lung cancer, breast cancer and 
melanoma result due to late detection. All of these cancers, if detected 
early, are 100% curable. In this study, we develop and evaluate algorithms 
to diagnose Breast cancer, Melanoma, and Lung cancer. In the hrst part 
of the study, we employed a normalised Gradient Descent and an Artih- 
cial Neural Network to diagnose breast cancer with an overall accuracy of 
91% and 95% respectively. In the second part of the study, an artificial 
neural network coupled with image processing and analysis algorithms was 
employed to achieve an overall accuracy of 93% A naive mobile based ap¬ 
plication that allowed people to take diagnostic tests on their phones was 
developed. Finally, a Support Vector Machine algorithm incorporating im¬ 
age processing and image analysis algorithms was developed to diagnose 
lung cancer with an accuracy of 94%. All of the aforementioned systems 
had very low false positive and false negative rates. 
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2 Introduction 


Breast cancer, lung cancer and melanoma together account for over 30% 
of the total cases of cancer reported in 2014. With over 200,000 deaths 
last year and an expected 300,000 deaths this year, it is imperative that 
measures are taken to prevent the rise in the deaths occurring collectively 
due to these diseases. 

The 5-year survival rate for lung cancer is 54% However, if not diag¬ 
nosed early, the chances of survival reduce drastically to 16%. The 5-year 
survival rate for breast cancer is 86% but it falls to 22% if not diagnosed 
early. Finally, the 5-year survival rate for melanoma is 95% but it falls 
to 22% if not diagnosed early. All of these diseases, if diagnosed early, are 
100% curable. To diagnose these diseases early, we develop and apply novel 
machine learning algorithms. 

There are several benehts to using Machine Learning algorithms in the 
field of biomedicine. It eliminates the added dimension of mistakes com¬ 
mitted due to human carelessness. The extensive amount of data available 
online also makes it possible for the algorithms to train themselves and 
achieve near perfect accuracy - something that is very difficult for any hu¬ 
man to achieve. Furthermore, it is easy to replicate and can be shared 
globally, thus reducing costs associated with logistics and labor force. 

For this study, machine learning algorithms such as Support Vector Ma¬ 
chine, Artificial Neural Network and Gradient Descent were used. Image 
processing and analysis was also used to obtain features from images in the 
case of lung cancer and melanoma. 

The next step was feature selection. For breast cancer, features such 
as the smoothness, concavity, fractal dimension, etc. were used from the 
WBC database. For lung cancer, image processing and analysis was applied 
on CT images from the NCBI database to obtain features such as area, 
convex area, solidity, etc. For melanoma, the ABCD standard prescribed 
by dermatologists was used. The images were obtained through the PH^ 
dataset. 

Outline The remainder of this article is organised as follows. Section 3 
contains our research, methods and findings for detecting lung cancer. Sec¬ 
tion 4 contains our research, methods and findings for detecting melanoma. 
Section 5 contains our research, methods and findings for detecting breast 
cancer. Section 6 contains our conclusions. Section 7 and Section 8 present 
the acknowledgements and the references respectively. 
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3 Lung Cancer 


Support Vector Machine along with image processing and image analysis 
algorithms were used in this part of the study. 

3.1 Image pre-processing, processing and segmentation 

Total variation denoising, which is effective at preserving edges while smooth¬ 
ing noise in flat regions, was used in the pre-processing stage 

The optimal thresholding proposed in [17] is applied to the pre-processed 
image to obtain a segmentation threshold. 

In this equation, Tj is the segment threshold after the i*’' step. To choose a new segmentation 
threshold, we apply Ti to the image to separate the pixels into body and nobody pixels. Let 
Ua and Ub be the mean grey level of body and nobody pixels. 


Pixels with density lower than the threshold value are assigned the value 
1 and the other pixels are assigned the value 0. The remaining non-body 
pixels are eliminated through morphological closing. 
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3.2 Feature Selection 


After the ROI (Region Of Interest) was obtained, GLCM (Gray Level Go- 
occurrence Matrix) was used to extract the features. The features extracted 
and the methods used are shown in the table. (Table 1) [20] 


Feature 

Formula 

Function 

Area (A) 

- 

Total number of pixels in 
the ROI 

Convex Area (CA) 


Total number of pixels in 
the convex region of the 
ROI 

Equivalent Diameter (ED) 

ED = 

Area of circle equal to the 
area of the ROI 

Solidity (S) 

^ ~ CA 

Ratio of Area to Convex 
Area 

Energy (E) 

n 

E = 

0 

Summation of the squared 
elements in the GLCM 

Contast (C) 

N N 

i 3 

Measure of contrast be¬ 
tween intensity of adjacent 
pixels over the whole ROI 

Eccentricity (EC) 


Ratio of distance between 
the foci of the ellipse and 
its major axis length 

Homogeneity (H) 

h=y\ , 

i 1 + l* + il 

Measure of closeness of the 
distribution of elements in 
the GLCM to the GLCM of 
each ROI 

Correlation (CO) 

N N .. .. 

_^^P(hJ) - pr.fic 

ar.ac 

* 3 

Measure of correlation of 
pixel to its neighbor over 
the ROI 
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3.3 Support Vector Machine Development 

A linear classifier was used. We chose the hyperplane such that the distance 
from it to the nearest data point on each side is maximized. The linear 
classifier that such a hyperplane defines is known as a maximum classifier. 



Figure 1; An example of a maximum margin classifier 
The CT images used were obtained from NCBFs online database. 

3.4 Results 

Over 197 images were tested on the SVM. The classifier only chose 2 out 
of the 9 features available for classification at any given time. The most 
reliable features were Area and Convex Area. 

A breakup of 70 : 30 was used for training and testing of our algorithm. 
The data presented here is from the testing period of the algorithm. 

Based on the classification, the tumour was either diagnosed as benign 
or malignant. The SVM delivered an accuracy of over 94%, mis-diagnosing 
only 11 of the images. 5 were false positive and 6 were false negative. 


Output 

Significance 

0 

Benign 

1 

Malignant 


The overall accuracy delivered by the SVM was, as noted earlier, 94% 
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The results have been summarised below (Figure 2). 


Overall 

Outcome 


Overall Results 


Positive 

Negative 

TP : 129 

FP :5 

TN :59 

FN 1 6 

Inconclusive : 0 

Positive Prediction Rate 

96.23% 

Negative Prediction Rate 

90.77% 


Sensitivity 95.56% 


Specificity 92.12% 


Matthews Correlation 
Coefficient 


0.874 


Figure 2: Summary of the results from the Lung Cancer Classifier 
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4 Melanoma 


Artificial Neural Networks along with image processing and image analysis 
algorithms were used in this part of the study. 

4.1 Image pre-processing, processing and segmentation 

Noise was removed using morphological closing. An accurate border was 
determined. The gradient function was used. 

4.2 Feature Selection 

Features such as asymmetry along the minor axes, asymmetry along the 
major axis, border irregularity, entropy, color variation, diameter, etc. were 
extracted using built-in or custom coded functions. The asymmetry of the 
lesion was determined by overlapping the halves of the images on their ma¬ 
jor and minor axes. To accurately determine the border in-built functions 
were used. 


4.3 Artificial Neural Network Development 

The neural network was developed using an in-built toolbox in MATLAB. 
The features extracted were used as input in the ANN. 



Figure 3: An example of the structure of an Artificial Neural Network with multiple 
layers. 

4.4 Mobile Application 

The ANN was converted into a standalone service and deployed on the 
web. It was then integrated into the mobile application. 

The Melanomore application is currently under development and further 
details will be released shortly. 
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Application Service Proxy Melanomore Service 

Sends input to the ^ Sends the information ^ Initiates the diagnostic 

server to server procedures 



Figure 4: The flowchart of the processes of the application. 


4.5 Results 

The ANN employed in the prototype of the application had a promising 
accuracy of 92% For training and validating purposed over 450 images were 
used (400 from PH^ dataset and 56 images sourced from other publicly 
available datasets) Throughout the training period, from an initial accuracy 
of 85% at use of 250 images, it gradually increased in accuracy to 93% at 
the end of 450 images used. Also, the FP, FN, TP, TN and the inconclusive 
rates were low. 

A breakup of 70 : 30 was used for training and testing of our algorithm. 
The data presented here is from the testing period of the algorithm. 

As the training size and the increased, the inconclusive diagnosis rate 
reduced too. 
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When the users click on Get 
Results after uploading the 
image, they will be redirected 
to a page where they will be 
able to see their complete 
report. The report will 
automatically get uploaded 
to the database and should 
the user require to submit the 
report to their dermatologist, 
they can do so with inbuilt 
functionality. 


Figure 5: The mobile application for diagnosis of melanoma. 

A T-test was taken to prove that the values calculated for asymmetry 
and border were actually different for both cases - malign as well as benign. 
The results of the T-test showed that both the groups were significantly 
different as both of them had extremely low probability values to be in the 
same group. 

The results have been summarised below (Figure 8). 
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Diagnosis accuracy by Training Sample Size 


550 



86.70% 


88.40% 

Accuracy In Diagnosis 


90.10% 


91.80% 


Figure 6: Comparison of training size and the accuracy of the algorithm. 


Training Sampie Set vs. inconciusive Diagnosis Rate 



0% - 

250 300 350 400 450 

Training Set Size 

Figure 7: Comparison of training size and the inconclusive diagnosis rate of algorithm. 

5 Breast Cancer 

Artificial Neural Network and Gradient Descent were used in this part of 
the study. 

5.1 Feature Selection 

The features were selected from the Wisconsin Breast Cancer Database. 


Feature 

Details 

Clump Thickness 

Assesses if cells are multi or mono¬ 
layered 

Uniformity of Cell Size 

Consistency in size of cells 

Uniformity in Cell Shape 

Estimation of equality of cell shapes 
and identification of marginal vari¬ 
ances 
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Marginal Adhesion 

Quantify the number of cells on the 
outside of epithelial that stick to¬ 
gether 

Single Epithelial Cell Size 

Determines if epithelial cells are sig¬ 
nificantly enlarged 

Bare Nuclei 

Quantify the proportion of cells 
that are not surrounded by cyto¬ 
plasm to those that are 

Bland Chromatin 

Rates uniformity of texture of nu¬ 
cleus 

Normal Nucleoli 

Determines whether nucleoli are 
small or large, barely visible or 
more visible 

Mitoses 

The level of mitotic acitivity 


5.2 Gradient Descent Development 

A normalized gradient descent was developed in Python with custom li¬ 
braries. Optimal results appeared at (number of iterations) T = 5000 and 
a = 2- 10 -^ 

5.3 Artificial Neural Network Development 

A custom made ANN consisting of 7 layers was coded in Octave to process 
the dataset and achieve optimal results [25]. 

5.4 Results 

Both the algorithms that were designed for diagnosing breast cancer de¬ 
livered a high level of accuracy. The ANN rose in accuracy from 87% to 
around 95% on increase in training size. The gradient descent delivered an 
accuracy of 91% after 5000 iterations. The false positive and false negative 
rate in this case, was low too (Figure 9, Figure 10). 

A breakup of 55 : 45 was used for training and testing of our algorithm. 
The data presented here is from the testing period of the algorithm. 
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Overall 

Outcome 


Overall Results 


Positive 

Negative 

TP : 234 

FP : 15 

TM : 157 

FN M7 

Inconclusive : 6 

Positive Prediction Rate 

93.98% 

Negative Prediction Rate 

90.23% 


Sensitivity 93.23% 


Specificity 91.28% 


Matthews Correlation 
Coefficient 

0.844 


Figure 8: Summary of the results from the Melanoma classifier. 
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Figure 9: Comparison of the training data size and the accuracy of the algorithm. 

The performance of the algorithms have been summarised (Figure 11). 
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Accuracy using Normalized Gradient Descent 


100 



Figure 10: Comparison of the number of iterations and the accuracy of the algorithm. 


Overall 

Outcome 


Overall; Results 


Positive 

Negative 

—^ 

FP :5 

TN : 155 

FN :9 

Inconclusive : 0 

Positive Prediction Rate 

94.a% 

Negative Prediction Rate 

94.4% 


Sensitivity 91.0% 


Specificity 96.S% 


Matthews Correlation 
Coefficient 

0.8B6 


Figure 11: Summary of the results from the Breast Cancer Classifier. 


6 Conclusion 

The algorithms achieved a high accuracy: The ANN used for detecting 
melanoma achieved an accuracy of 93%. The ANN and the Gradient De¬ 
scent used for detecting breast cancer achieved an accuracy of 95% and 
91% and the SVM used for detecting lung cancer achieved an accuracy of 
94%. These algorithms also succeeded in achieving a very low false positive 
and false negative rate, indicating that our experiments were a success. All 
the algorithms used in this study were self developed. The ever learning 
nature of the algorithms make it possible for them to achieve near perfect 
accuracy on an increase in training size - something that an proposed on¬ 
line network facilitate. 

Image processing and image analysis was used to obtain data directly 
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from the CT scans and the skin images in the cases of lung cancer and 
melanoma. We decided to use the WBC dataset because it has been his¬ 
torically proven to be reliable. Images of the melanoma skin lesions as well 
as CT scans were taken from online databases (primarily PH^ dataset and 
NCBI cancer imaging archive). We feel that the image processing algo¬ 
rithms gave us an advantage by allowing us to directly obtain data from 
the images. The success of our experiments also results due to our selec¬ 
tion of features and methods for their calculations which were based on our 
exhaustive research of past literature. 

Although previous work has succeeded in building computer aided sys¬ 
tems for diagnosis (CAD) of diseases, CAD of cancer has only recently 
gained traction. The results achieved in this study are comparable to the 
seminal works in this field. In addition to the well-performing algorithms, 
image processing and analysis algorithms and feature selections, the con¬ 
struction of a naive mobile application in the case of melanoma diagnosis 
is something that people will greatly benefit from. The mobile applica¬ 
tion allows people directly access and take diagnostic tests on their mobile 
phones, hence eliminating costs associated with logistics and consultation. 
This is a boon in a world where consultation fees are sky-rocketing. In 
addition to this, a mobile application makes it feasible to carry out reliable 
tests in resources poor areas where taking exhaustive tests is not at all 
possible. 

On observation, the algorithms used in this study also outperform sev¬ 
eral commercial software such as SkinSeg. With virtually no commercial 
software available for lung cancer and breast cancer diagnosis, we further 
plan to develop novel software that can be readily implemented in clinics 
all around the world. This will facilitate a global collaborative organisa¬ 
tion that works on diagnosis research and also uses our software to actually 
diagnose diseases. 

In the future, we plan to develop diagnostic algorithms for other diseases. 
We are also working on the online diagnosis software for collaborative re¬ 
search. 
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